7BUIS008W Data Mining & Machine Learning - Coursework 1

Andrew Keats

22 November 2017

Table of Contents

Question 1: White Wine clustering

Starting off

You need to conduct the k-means clustering analysis of the white wine sheet. Find the ideal number of clusters (please justify your answer). Choose the best two possible numbers of clusters and perform the k-means algorithm for both candidates. Validate which clustering test is more accurate. For the winning test, get the mean of the each attribute of each group. Before conducting the k-means, please investigate if you need to add in your code any pre-processing task (justify your answer). Write a code in R Studio to address all the above issues. In your report, check the consistency of those produced clusters, with information obtained from column 12.

In the White Win dataset provided, column 12 is labelled Quality; this is a qualitative value assigned by a human through the subjective means of tasting. Essentially, by try to cluster against all variables apart from Quality and then comparing against this variable, we are trying to look for some correlation between all the variables in combination and the subjective quality of wine.

Firstly we need to load the data…

#going to import the Excel spreadsheet WhiteWine dataset
wine.raw <- read_excel("../data/Whitewine.xlsx")

Here’s a glance at the dataset

head(wine.raw)
## # A tibble: 6 x 12
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar`
##             <dbl>              <dbl>         <dbl>            <dbl>
## 1             7.0               0.27          0.36             20.7
## 2             6.3               0.30          0.34              1.6
## 3             8.1               0.28          0.40              6.9
## 4             7.2               0.23          0.32              8.5
## 5             7.2               0.23          0.32              8.5
## 6             8.1               0.28          0.40              6.9
## # ... with 8 more variables: chlorides <dbl>, `free sulfur dioxide` <dbl>,
## #   `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>,
## #   sulphates <dbl>, alcohol <dbl>, quality <dbl>
str(wine.raw)
## Classes 'tbl_df', 'tbl' and 'data.frame':    4898 obs. of  12 variables:
##  $ fixed acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free sulfur dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total sulfur dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : num  6 6 6 6 6 6 6 6 6 6 ...

We want to scale the data to allow all attributes to be compared more easily. First of all let’s split our data so we have two tables, one with all the attributes of wine and the other just with the humanly perceived quality.

wine.all_but_q <- wine.raw[1:11]
wine.q <- wine.raw$quality

#Wine properties
str(wine.all_but_q)
## Classes 'tbl_df', 'tbl' and 'data.frame':    4898 obs. of  11 variables:
##  $ fixed acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free sulfur dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total sulfur dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
#Wine quality values
str(wine.q)
##  num [1:4898] 6 6 6 6 6 6 6 6 6 6 ...

Now we scale the data

wine.scaled <- as.data.frame(scale(wine.all_but_q))

#Summary of scaled wine data
summary(wine.scaled)
##  fixed acidity      volatile acidity   citric acid      residual sugar   
##  Min.   :-3.61998   Min.   :-1.9668   Min.   :-2.7615   Min.   :-1.1418  
##  1st Qu.:-0.65743   1st Qu.:-0.6770   1st Qu.:-0.5304   1st Qu.:-0.9250  
##  Median :-0.06492   Median :-0.1810   Median :-0.1173   Median :-0.2349  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.52758   3rd Qu.: 0.4143   3rd Qu.: 0.4612   3rd Qu.: 0.6917  
##  Max.   : 8.70422   Max.   : 8.1528   Max.   :10.9553   Max.   :11.7129  
##    chlorides       free sulfur dioxide total sulfur dioxide
##  Min.   :-1.6831   Min.   :-1.95848    Min.   :-3.0439     
##  1st Qu.:-0.4473   1st Qu.:-0.72370    1st Qu.:-0.7144     
##  Median :-0.1269   Median :-0.07691    Median :-0.1026     
##  Mean   : 0.0000   Mean   : 0.00000    Mean   : 0.0000     
##  3rd Qu.: 0.1935   3rd Qu.: 0.62867    3rd Qu.: 0.6739     
##  Max.   :13.7417   Max.   :14.91679    Max.   : 7.0977     
##     density               pH             sulphates      
##  Min.   :-2.31280   Min.   :-3.10109   Min.   :-2.3645  
##  1st Qu.:-0.77063   1st Qu.:-0.65077   1st Qu.:-0.6996  
##  Median :-0.09608   Median :-0.05475   Median :-0.1739  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.69298   3rd Qu.: 0.60750   3rd Qu.: 0.5271  
##  Max.   :15.02976   Max.   : 4.18365   Max.   : 5.1711  
##     alcohol        
##  Min.   :-2.04309  
##  1st Qu.:-0.82419  
##  Median :-0.09285  
##  Mean   : 0.00000  
##  3rd Qu.: 0.71974  
##  Max.   : 2.99502
boxplot(wine.scaled, main="Looking at the data graphically", xlab="Wine Attributes", ylab="Scaled values") 

We can see from these box-plots that some attributes seem to have some clear outliers that would suggest erroneous data and not just natural extremes. As such, we can decide that it’s worth cleansing the data a little by removing these outliers from the dataset. For example, Alcohol on the most right column seems to have very clear boundaries as we’d expect from wine; when that is compared with some other attributes, they seem to tell a different story: Chlorides seems to have a lot of values that are in the upper quartile, and a large distance between min and max values but when you look it you can see there’s a gradient that suggests a normal distribution; in contrast to this, the columns Residual Sugar, Free Sulfur Dioxide and Density all seem to not only have relatively large min and max distances but there seem to be uppermost values that with nearest neighbour values that are a relatively large distance away.

Below are density line graphs to demonstrate the difference between attributes that don’t seem to have outliers compared to those that do.

plot(density(wine.scaled$`alcohol`))

plot(density(wine.scaled$`chlorides`))

plot(density(wine.scaled$`free sulfur dioxide`))

plot(density(wine.scaled$`density`))

In order to work out which attributes should be considered to have valid outliers, I’ve gone with a heuristic approach, choosing to look at the distance between the uppermost outliers for each attribute and it’s nearest neighbour.

#Create a list to populate with our tail neighbour distances
tail_deltas <- c()

for (attrib in wine.scaled) {
 #get the last two values
 data_tails <- tail(sort(attrib),2)
 #push the delta on to our list 
 tail_deltas <- c(tail_deltas, diff(data_tails))
}

#grab out attribute keys to include in our new table/frame
attributes <- names(wine.scaled)

#make a new dataframe from 
dataframe <- data.frame(attributes = attributes, tail_neighbour_d=tail_deltas)

#get the order for the nearest neighbour starting with the greatest distance and descending
neighbour_order <- order(dataframe$tail_neighbour_d, decreasing=TRUE)

#now apply the order to the frame
sorted_attributes_by_neighbour_d <- dataframe[ neighbour_order, ]
sorted_attributes_by_neighbour_d
##              attributes tail_neighbour_d
## 8               density        9.5890647
## 6   free sulfur dioxide        8.3788351
## 4        residual sugar        6.7428254
## 3           citric acid        3.5531375
## 1         fixed acidity        2.8440459
## 5             chlorides        2.0596881
## 7  total sulfur dioxide        1.7294905
## 2      volatile acidity        0.9425113
## 10            sulphates        0.1752452
## 11              alcohol        0.1218897
## 9                    pH        0.0662249

Given the findings, I think we can just consider the top five attributes in the above list as ones to cleanse for outliers. A lot of sources online warn against arbitrarily getting rid of outliers because it might be the case that valid information is being lost when what you really want to account for is bad data.

To clarify, the attributes to be processed are: - density
- free sulfur dioxide
- residual sugar
- citric acid
- fixed acidity

Boxplot has an outlier property that we can use to collect values that we might want to remove, so this is the one option we will look at for cleansing data.

wine.scaled_cleansed_bp <- wine.scaled[ !(wine.scaled$density %in% boxplot(wine.scaled$density, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`free sulfur dioxide` %in% boxplot(wine.scaled$`free sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`residual sugar` %in% boxplot(wine.scaled_cleansed_bp$`residual sugar`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`citric acid` %in% boxplot(wine.scaled_cleansed_bp$`citric acid`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`fixed acidity` %in% boxplot(wine.scaled_cleansed_bp$`fixed acidity`, plot=FALSE)$out), ]

boxplot(wine.scaled_cleansed_bp, main="Looking at the cleansed data graphically", xlab="Wine Attributes", ylab="Scaled values") 

While this new set of data is now has no values beyond the outermost quartile ranges, this is arguably too harsh a treatment. An alternative option is to arbitrarily work with the interquartile ranges; what have done is to tweak the multiplier of the interquartile range until it successfully meant that only the most extreme outliers were discard. In the end a value 5 times that of the IQR worked well to pick off only values at the very tips of the tails.

#Get the top 5 variables with the highest outlier distance
worst_outliers <- head(sorted_attributes_by_neighbour_d$attributes, n=5)

wine.scaled_cleansed_iqr <- wine.scaled

# Create a variable to store the row id's to be removed
iqr_outliers <- c()
quartile_multiplier = 5

# Loop through the list of columns you specified
for(i in worst_outliers){

 # Get the Min/Max values
 max <- quantile(wine.scaled_cleansed_iqr[,i],0.75, na.rm=FALSE) + (IQR(wine.scaled_cleansed_iqr[,i], na.rm=FALSE) * quartile_multiplier )
 min <- quantile(wine.scaled_cleansed_iqr[,i],0.25, na.rm=FALSE) - (IQR(wine.scaled_cleansed_iqr[,i], na.rm=FALSE) * quartile_multiplier )
 
 # Get the id's using which
 idx <- which(wine.scaled_cleansed_iqr[,i] < min | wine.scaled_cleansed_iqr[,i] > max)
 
 # Output the number of outliers in each variable
 #print(paste(i, length(idx), sep=' - removing: '))
 
 # Append the outliers list
 iqr_outliers <- c(iqr_outliers, idx) 
}

# Sorting outliers
iqr_outliers <- sort(iqr_outliers)

# Remove the outliers
wine.scaled_cleansed_iqr <- wine.scaled_cleansed_iqr[-iqr_outliers,]

boxplot(wine.scaled_cleansed_iqr, main="Looking at the IQR cleansed data graphically", xlab="Wine Attributes", ylab="Scaled values") 

Now that the data looks a lot cleaner, it’s time to start working with the data to try and find the best clustering. To begin with, nbclust will be used to see if that produces anything useful.

number_of_clusters <- NbClust(wine.scaled_cleansed_iqr,
                min.nc=2, max.nc=15,
                method="kmeans")
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 3 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

The following table displays the results recommending potential values for k

table(number_of_clusters$Best.n[1,])
## 
##  0  2  3  4  5  6  8 13 14 15 
##  2  9  3  2  3  1  1  1  3  1

The bar chart more easily conveys this.

barplot(table(number_of_clusters$Best.n[1,]), 
       xlab="Number of Clusters",
       ylab="Number of Criteria",
       main="Number of Clusters Chosen by 30 Criteria")

From the bar graph above we can see that there seems to be an clear leader in terms of suggested number of clusters, being k = 2. There are however other values that should be explored to see how they compare: 3, 5 and 14. To confirm that the accuracy of this result in terms of the best contender for, we can plot the sum of square errors and looks for a pronounced bend in the graph. Where the most pronounced bend is, this is a contender for the value for k.

sse_list <- 0
for (i in 1:15){
 sse_list[i] <- sum(kmeans(wine.scaled_cleansed_iqr, centers=i)$withinss)
}
## Warning: did not converge in 10 iterations
plot(1:15,
 sse_list,
 type="b",
 xlab="Number of Clusters",
 ylab="Within groups sum of squares")

The histogram for the Sum of Square Errors partially backs up the results of nbclust seeing as there is ‘elbow’ on the line at 2 on the Number of Clusters. Having said that, the kink between 5 and 7 suggest that this range should also be tested for k.

#If we're going to run tests on the k-means against the data we need to remove the outliers from our quality column too
wine.q_cleansed <- wine.q[-iqr_outliers]

fit.km2 <- kmeans(wine.scaled_cleansed_iqr, 2)
fit.km3 <- kmeans(wine.scaled_cleansed_iqr, 3)
fit.km4 <- kmeans(wine.scaled_cleansed_iqr, 4)
fit.km5 <- kmeans(wine.scaled_cleansed_iqr, 5)
fit.km6 <- kmeans(wine.scaled_cleansed_iqr, 6)
fit.km7 <- kmeans(wine.scaled_cleansed_iqr, 7)
fit.km11 <- kmeans(wine.scaled_cleansed_iqr, 11)
fit.km14 <- kmeans(wine.scaled_cleansed_iqr, 14)

plotcluster(wine.scaled_cleansed_iqr, fit.km2$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km3$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km4$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km5$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km6$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km7$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km11$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km14$cluster)

Mapping/Fitting the clusters to the data

Now that we have experimented with various different values for k, when applying k-means clustering to the wine data, it’s now time to see if it can be fit to the data that delivers anything obviously meaningful with regards to wine quality. While the strongest cluster option appears to be 2, I thought it was worth looking at how to map the data against quality by looking at the quality values as though they were factors; so, as you can see in the following table, there are only 7 unique values out of 10 possible scores for quality, meaning that trying to fit the data is actually easiest against 7 clusters, one per quality value.

To comapre the clusters to the quality scores we use a Confusion matrix to see where the values lie within those 2 sets of data. Further to that, to evaluate the confusion matrix mathematically, we will apply the Rand Index method, which Wikipedia (https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation) describes thusly:

The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula RI={}

wine.q_table <- table(wine.q_cleansed)
wine.q_table
## wine.q_cleansed
##    3    4    5    6    7    8    9 
##   19  163 1456 2190  880  175    5
barplot(wine.q_table,
       xlab="Quality values",
       ylab="Frequency",
       main="Distribution of wines across quality values")

confuseTable.km7 <- table(wine.q_cleansed, fit.km7$cluster)

names(dimnames(confuseTable.km7)) <- list("Quality", "Clusters")

confuseTable.km7
##        Clusters
## Quality   1   2   3   4   5   6   7
##       3   0   6   6   2   1   2   2
##       4  10  18  48  18   4  19  46
##       5  71 409 278 327  49  51 271
##       6 267 366 396 347  47 327 440
##       7 163  34 142 102   2 324 113
##       8  31   6  22  18   2  76  20
##       9   0   0   1   0   0   4   0
randIndex(confuseTable.km7)
##        ARI 
## 0.03058022

Poor results

Given this low value of 0.03410079, is so far from the ideal, and as can be seen from the matrix, there seems to be a spread across all clusters, we can surmise that either the White Wine dataset was not cleansed thoroughly enough or that k-means clustering i=simply isn’t an effective way of determining quality.

In order to be sure that it is indeed the methodology that is unsuitable rather than the data being insufficiently processed, looking at a more severe form of data cleansing may prove insightful; to that end, removing all boxplot outliers across all variables and running the whole process again is worth it just to see if the results are more conclusive.

wine.properties <- names(wine.all_but_q)
# 
wine.scaled_cleansed_bp_all <- wine.scaled

wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$density %in% boxplot(wine.scaled_cleansed_bp_all$density, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`free sulfur dioxide` %in% boxplot(wine.scaled_cleansed_bp_all$`free sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`residual sugar` %in% boxplot(wine.scaled_cleansed_bp_all$`residual sugar`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`citric acid` %in% boxplot(wine.scaled_cleansed_bp_all$`citric acid`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`fixed acidity` %in% boxplot(wine.scaled_cleansed_bp_all$`fixed acidity`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`volatile acidity` %in% boxplot(wine.scaled_cleansed_bp_all$`volatile acidity`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$chlorides %in% boxplot(wine.scaled_cleansed_bp_all$chlorides, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`total sulfur dioxide` %in% boxplot(wine.scaled_cleansed_bp_all$`total sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$pH %in% boxplot(wine.scaled_cleansed_bp_all$pH, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$sulphates %in% boxplot(wine.scaled_cleansed_bp_all$sulphates, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$alcohol %in% boxplot(wine.scaled_cleansed_bp_all$alcohol, plot=FALSE)$out), ]

# for (prop in wine.properties) {
#   wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all[prop] %in% boxplot(wine.scaled_cleansed_bp_all[prop], plot=FALSE)$out), ]
# }

boxplot(wine.scaled_cleansed_bp_all, main="Boxplot all outliers cleansed", xlab="Wine Attributes", ylab="Scaled values")

number_of_clusters_severe_cleanse <- NbClust(wine.scaled_cleansed_bp_all,
                min.nc=2, max.nc=15,
                method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 11 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 4 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

The following table displays the results recommending potential values for k

table(number_of_clusters_severe_cleanse$Best.n[1,])
## 
##  0  2  3  4 10 14 15 
##  2 11  5  2  1  4  1

The bar chart more easily conveys this.

barplot(table(number_of_clusters_severe_cleanse$Best.n[1,]), 
       xlab="Number of Clusters",
       ylab="Number of Criteria",
       main="Number of Clusters Chosen by 30 Criteria")

sse_list <- 0
for (i in 1:15){
 sse_list[i] <- sum(kmeans(wine.scaled_cleansed_bp_all, centers=i)$withinss)
}

plot(1:15,
 sse_list,
 type="b",
 xlab="Number of Clusters",
 ylab="Within groups sum of squares")

#If we're going to run tests on the k-means against the severely cleansed data we need to remove the outliers from our quality column too
bp_severe_outliers <- unique(unlist(mapply(function(x, y) sapply(setdiff(x, y), function(d) which(x==d)), wine.scaled, wine.scaled_cleansed_bp_all)))

wine.q_cleansed_severe <- wine.q[-bp_severe_outliers]

fit_severe.km2 <- kmeans(wine.scaled_cleansed_bp_all, 2)
fit_severe.km3 <- kmeans(wine.scaled_cleansed_bp_all, 3)
fit_severe.km4 <- kmeans(wine.scaled_cleansed_bp_all, 4)
fit_severe.km5 <- kmeans(wine.scaled_cleansed_bp_all, 5)
fit_severe.km6 <- kmeans(wine.scaled_cleansed_bp_all, 6)
fit_severe.km7 <- kmeans(wine.scaled_cleansed_bp_all, 7)
fit_severe.km11 <- kmeans(wine.scaled_cleansed_bp_all, 11)
fit_severe.km14 <- kmeans(wine.scaled_cleansed_bp_all, 14)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km2$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km3$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km4$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km5$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km6$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km7$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km11$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km14$cluster)

confuseTable_severe.km7 <- table(wine.q_cleansed_severe, fit_severe.km7$cluster)

names(dimnames(confuseTable_severe.km7)) <- list("Quality", "Clusters")

confuseTable_severe.km7
##        Clusters
## Quality   1   2   3   4   5   6   7
##       3   1   4   0   1   2   1   0
##       4  10  32   2   9   8  23   4
##       5 193 180  59 372  37 196  64
##       6 239 268 186 329 264 338 240
##       7  77  84 116  40 241  96 141
##       8  14   8  38   7  40  10  28
##       9   0   0   0   0   4   0   0
randIndex(confuseTable_severe.km7)
##       ARI 
## 0.0313044

Results so far

Given that the poor value of 0.03410079 for the dataset that was only lightly ‘pruned’ is actually closer to 1 than the 0.02536692 of the last result, the previous dataset actually led to a better set of clusters to act as predictors of Quality.

Alternative clusters

It would appear that the less intensive data cleansing was more appropriate if the ARI value is anything to go by. To refresh what the confusion matrix for that looked like, it is repeated below:

confuseTable.km7
##        Clusters
## Quality   1   2   3   4   5   6   7
##       3   0   6   6   2   1   2   2
##       4  10  18  48  18   4  19  46
##       5  71 409 278 327  49  51 271
##       6 267 366 396 347  47 327 440
##       7 163  34 142 102   2 324 113
##       8  31   6  22  18   2  76  20
##       9   0   0   1   0   0   4   0
randIndex(confuseTable.km7)
##        ARI 
## 0.03058022

If we look at this more closely, we can see that while there are wines of various qualities spread across all clusters, there are some clusters are weighted in favour of higher quality wines or the middle range. From this observation it can be posited that some more meaningful fitting might be found between factors of quality, “Good”, “Mediocre” and “Bad”. So one final experiment before drawing to a conclusion is to try to fit the data against 3 clusters.

Creating 3 quality factors & attempting one last fit.

wine.q_cleansed_f3 <- cut(wine.q_cleansed, 3, labels = c("bad", "mediocre", "good"))

confuseTable.km3 <- table(wine.q_cleansed, fit.km3$cluster)
names(dimnames(confuseTable.km3)) <- list("Quality", "Clusters")

confuseTable.km3_f3 <- table(wine.q_cleansed_f3, fit.km3$cluster)
names(dimnames(confuseTable.km3_f3)) <- list("Quality", "Clusters")

confuseTable.km3
##        Clusters
## Quality   1   2   3
##       3   2  10   7
##       4  45  42  76
##       5 282 786 388
##       6 731 789 670
##       7 462 141 277
##       8  90  27  58
##       9   3   0   2
randIndex(confuseTable.km3)
##       ARI 
## 0.0349165
confuseTable.km3_f3
##           Clusters
## Quality       1    2    3
##   bad       329  838  471
##   mediocre 1193  930  947
##   good       93   27   60
randIndex(confuseTable.km3_f3)
##        ARI 
## 0.02601754

The results of these other attempts to fit the data against 3 clusters has not yielded (significantly) better results. The very last thing is to see how the suggestion by NbClust works out.

Fitting to NbClust suggested k = 2

wine.q_cleansed_f2 <- cut(wine.q_cleansed, 2, labels = c("bad", "good"))

confuseTable.km2 <- table(wine.q_cleansed, fit.km2$cluster)
names(dimnames(confuseTable.km2)) <- list("Quality", "Clusters")

confuseTable.km2_f2 <- table(wine.q_cleansed_f2, fit.km3$cluster)
names(dimnames(confuseTable.km2_f2)) <- list("Quality", "Clusters")

confuseTable.km2
##        Clusters
## Quality    1    2
##       3    8   11
##       4  109   54
##       5  608  848
##       6 1329  861
##       7  725  155
##       8  147   28
##       9    4    1
randIndex(confuseTable.km2)
##        ARI 
## 0.02553242
confuseTable.km2_f2
##        Clusters
## Quality    1    2    3
##    bad  1060 1627 1141
##    good  555  168  337
randIndex(confuseTable.km2_f2)
##       ARI 
## 0.0346663

Writing up for the best results

According to the ARI values the highest being 0.03450436, for 3 clusters with 7 unique quality values, this would be the most successful k-means clustering of those explored, though only marginally. So before we reach out conclusion it’s necessary to display the characteristics of this particular set of clusters for k = 3.

#fit.km3

#K-means clustering with 3 clusters of sizes:
fit.km3$size
## [1] 1615 1795 1478
#Cluster means:
fit.km3$centers
##   fixed acidity volatile acidity citric acid residual sugar  chlorides
## 1    -0.7518320      -0.03341504 -0.36556862     -0.5949961 -0.2784045
## 2     0.1221039       0.03463949  0.22136042      0.9244884  0.4093015
## 3     0.6629243      -0.01252144  0.09769788     -0.4847543 -0.1947752
##   free sulfur dioxide total sulfur dioxide    density         pH
## 1          -0.2110590           -0.3806105 -0.6686403  0.7951063
## 2           0.6373365            0.7925332  0.9958298 -0.2184079
## 3          -0.5539881           -0.5553075 -0.4902981 -0.6038568
##     sulphates    alcohol
## 1  0.23936045  0.5421105
## 2  0.06211615 -0.8403982
## 3 -0.33752884  0.4237459

Conclusion

By the looks of it, this use of k-means is simply not appropriate against this set of data; it would seem that either using Principal Component Analysis or applying a method for being selective about which variables are used when looking for certain trends would be required in order to reduce the noise that comes from having so many dimensions. One tool in Data Science that could have proved beneficial for this would have been Exploratory Data Analysis; finding relationships between different variables and those connections to Quality might have led to a better understanding of factors affecting wine qualities, resulting some idea of how to select only certain variable to apply k-means to. Additionally, would consider initially testing against a smaller sample of data next time before investing so many CPU cycles to this task!

Question 2: White Wine clustering (Hierarchical)

Premise

You need to conduct the hierarchical clustering (agglomerative) clustering analysis of the white wine sheet. Investigate the hclust() function for single, complete, average methods. Create the visualization of all methods using a dendrogram. Look at the cophenetic correlation between each clustering result using cor.dendlist. Discuss the produced results after using the coorplot function. Write a code in R Studio to address all the above issues.

Preparation of data

As with Question 1, the data needs to be loaded, paritioned, then scaled. Assuming that this has all been done as before - there’s no need to demonstrate the exact same execution of code - we can simply demonstrate we have our data ready to work with.

Important Pre-processing issue

In carrying out the experiments with creating the 3 clusters and then comparing them, I experienced an issue with the R runtime due to the sheer size of the dataset and how was transformed into complexity of the hierarchical clusters; the console reported a nost stack overflow and after some research online the only option was to use a smaller dataset. In the previous code block you can see the sample sunction used to reduce the dasaset to be studied to approximately two-thirds of the original size. Unfortunately, this may well have an adverse impact on results but there’s not much I can do about that apart from maybe run the clustering again (multiple times) but with a fresh sample of the same size and then see if there’s much variation.

That issue aside, what follows is the summary of the sampled & scaled data.

summary(wine.scaled)
##  fixed acidity      volatile acidity   citric acid      residual sugar   
##  Min.   :-3.60000   Min.   :-1.9774   Min.   :-2.7479   Min.   :-1.1431  
##  1st Qu.:-0.66330   1st Qu.:-0.6880   1st Qu.:-0.5219   1st Qu.:-0.9285  
##  Median :-0.07596   Median :-0.1921   Median :-0.1921   Median :-0.2261  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.51138   3rd Qu.: 0.4030   3rd Qu.: 0.3850   3rd Qu.: 0.6715  
##  Max.   : 8.61666   Max.   : 8.1396   Max.   :10.9377   Max.   :11.5783  
##    chlorides       free sulfur dioxide total sulfur dioxide
##  Min.   :-1.5412   Min.   :-1.91571    Min.   :-3.0229     
##  1st Qu.:-0.4431   1st Qu.:-0.71125    1st Qu.:-0.6913     
##  Median :-0.1229   Median :-0.08034    Median :-0.1084     
##  Mean   : 0.0000   Mean   : 0.00000    Mean   : 0.0000     
##  3rd Qu.: 0.1974   3rd Qu.: 0.60792    3rd Qu.: 0.6669     
##  Max.   :13.7394   Max.   :14.54522    Max.   : 7.0264     
##     density               pH            sulphates          alcohol        
##  Min.   :-2.30592   Min.   :-3.1229   Min.   :-2.0906   Min.   :-2.04129  
##  1st Qu.:-0.75613   1st Qu.:-0.6503   1st Qu.:-0.6999   1st Qu.:-0.81951  
##  Median :-0.08577   Median :-0.0488   Median :-0.1785   Median :-0.08645  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.68415   3rd Qu.: 0.6195   3rd Qu.: 0.5169   3rd Qu.: 0.72807  
##  Max.   :14.90774   Max.   : 4.0946   Max.   : 5.1233   Max.   : 2.84581

Hierarchical clustering

Hierarchical clustering methods use a distance matrix as input for the algorithms; to that end, we need to transofrm the scaled date into a distance matrix before we pass it into our hierarchical clustering function.

wine.dist_matrix_euclidean <- dist(wine.scaled) # Euclidean distance matrix.
summary(wine.dist_matrix_euclidean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.378   4.229   4.420   5.206  27.360

Now we can start clustering data using the hclust library. As per the requirements of this question we are calling hclust 3 times, to evaluate the “single”, “complete” and “average” methods for heirarchical clustering. Assuming this hierarchcal clustering is meant to try and find relationships between the quantitative Wine properties and Wine quality, and knowing from the previous question that there are only 7 unique values for quality across the entire dataset, the next step is to cut out cluster trees to only have 7 branches.

wine.hclust_single <- hclust(wine.dist_matrix_euclidean, method = "single")
wine.hclust_complete <- hclust(wine.dist_matrix_euclidean, method = "complete")
wine.hclust_average <- hclust(wine.dist_matrix_euclidean, method = "average")

#Display dendrograms
plot(wine.hclust_single, main="Hierarchical clustering White Wine", sub = "Single method", labels=FALSE)
rect.hclust(wine.hclust_single, k=7, border="green") 

plot(wine.hclust_complete, main="Hierarchical clustering White Wine", sub = "Complete method", labels=FALSE)
rect.hclust(wine.hclust_complete, k=7, border="green") 

plot(wine.hclust_average, main="Hierarchical clustering White Wine", sub = "Average method", labels=FALSE)
rect.hclust(wine.hclust_average, k=7, border="green") 

wine.hclust_single_g7 <- cutree(wine.hclust_single, k=7)
table(wine.hclust_single_g7)
## wine.hclust_single_g7
##    1    2    3    4    5    6    7 
## 3225    2    1    1    1    1    1
wine.hclust_complete_g7 <- cutree(wine.hclust_complete, k=7)
table(wine.hclust_complete_g7)
## wine.hclust_complete_g7
##    1    2    3    4    5    6    7 
## 3094  123    7    1    5    1    1
wine.hclust_average_g7 <- cutree(wine.hclust_average, k=7)
table(wine.hclust_average_g7)
## wine.hclust_average_g7
##    1    2    3    4    5    6    7 
## 3212    8    8    1    1    1    1

Interpreting the data

My observation from the cut trees is that unless the data is especially skewed, it seems like using the Complete method is the one method of the three tried that might have any hope of clustering in a way that could correlate, albeit weakly, with Quality. Having said that, this is not the main task of this question, which is to compare the dendrograms against one another to see how similar they are; To do this we convert the lcuster models to the dendrogram data type before using cor.dendlist to cross-compare the trees.

d_list <- dendlist(
  "Single" = wine.hclust_single %>% as.dendrogram,
  "Complete" = wine.hclust_complete %>% as.dendrogram,
  "Average" = wine.hclust_average %>% as.dendrogram
)

cophentic_coefficient <- cor.dendlist(d_list, "cophenetic")

Below is a matrix where the dendrograms are compared on a scale of 0 to 1 where 1 is 100% parity; below that is a digram expressing the same data but using bpie charts to convey the information. By the share of the pies, we can see that the “single” (1) and “average” (3) methods deliver trees that are more similar to eachother than they are to “complete”.

# Print correlation matrix
round(cophentic_coefficient, 2)
##          Single Complete Average
## Single     1.00     0.35    0.74
## Complete   0.35     1.00    0.45
## Average    0.74     0.45    1.00
corrplot(cophentic_coefficient, "pie", "lower")

While we can see from coorplot, of even the corresponding table that the “Single” and “Average” dendrograms are 74% alike, this doesn’t mean they are better! We can run the hierarchical clustering through a confusion matrix as we did with k-means and find that the results do not give us any meaningdul relationship to Wine Quality. In fact all clustering methods used have ended with a massive bias towards one single cluster for which most nodes belong, across all qualities of wine!

hclust_single_g7_table <- table(wine.q, wine.hclust_single_g7)
names(dimnames(hclust_single_g7_table)) <- list("Quality", "Clusters")
hclust_single_g7_table
##        Clusters
## Quality    1    2    3    4    5    6    7
##       3   13    0    0    0    1    0    0
##       4  117    0    0    0    0    0    0
##       5  961    0    1    0    0    0    0
##       6 1437    2    0    1    0    1    1
##       7  584    0    0    0    0    0    0
##       8  112    0    0    0    0    0    0
##       9    1    0    0    0    0    0    0
randIndex(hclust_single_g7_table)
##           ARI 
## -0.0004977646
hclust_complete_g7_table <- table(wine.q, wine.hclust_complete_g7)
names(dimnames(hclust_complete_g7_table)) <- list("Quality", "Clusters")
hclust_complete_g7_table
##        Clusters
## Quality    1    2    3    4    5    6    7
##       3   12    0    1    0    0    1    0
##       4   88   28    1    0    0    0    0
##       5  902   54    5    0    1    0    0
##       6 1406   30    0    1    4    0    1
##       7  576    8    0    0    0    0    0
##       8  109    3    0    0    0    0    0
##       9    1    0    0    0    0    0    0
randIndex(hclust_complete_g7_table)
##        ARI 
## 0.01511415
hclust_average_g7_table <- table(wine.q, wine.hclust_average_g7)
names(dimnames(hclust_average_g7_table)) <- list("Quality", "Clusters")
hclust_average_g7_table
##        Clusters
## Quality    1    2    3    4    5    6    7
##       3   12    0    1    0    1    0    0
##       4  114    2    1    0    0    0    0
##       5  952    4    6    0    0    0    0
##       6 1438    1    0    1    0    1    1
##       7  583    1    0    0    0    0    0
##       8  112    0    0    0    0    0    0
##       9    1    0    0    0    0    0    0
randIndex(hclust_average_g7_table)
##         ARI 
## 0.002569654

Trying again with cleansed data

Just in case outliers have skewed the outcomes I’m going to run all cluster calls with data that has had outliers removed, using the bxoplot method.

wine.scaled_cleansed_bp_all <- wine.scaled

wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$density %in% boxplot(wine.scaled_cleansed_bp_all$density, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`free sulfur dioxide` %in% boxplot(wine.scaled_cleansed_bp_all$`free sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`residual sugar` %in% boxplot(wine.scaled_cleansed_bp_all$`residual sugar`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`citric acid` %in% boxplot(wine.scaled_cleansed_bp_all$`citric acid`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`fixed acidity` %in% boxplot(wine.scaled_cleansed_bp_all$`fixed acidity`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`volatile acidity` %in% boxplot(wine.scaled_cleansed_bp_all$`volatile acidity`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$chlorides %in% boxplot(wine.scaled_cleansed_bp_all$chlorides, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`total sulfur dioxide` %in% boxplot(wine.scaled_cleansed_bp_all$`total sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$pH %in% boxplot(wine.scaled_cleansed_bp_all$pH, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$sulphates %in% boxplot(wine.scaled_cleansed_bp_all$sulphates, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$alcohol %in% boxplot(wine.scaled_cleansed_bp_all$alcohol, plot=FALSE)$out), ]

wine_cleansed.dist_matrix_euclidean <- dist(wine.scaled_cleansed_bp_all) # Euclidean distance matrix.
summary(wine_cleansed.dist_matrix_euclidean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.095   3.798   3.869   4.577   9.623

Now we can start clustering data using the hclust library. As per the requirements of this question we are calling hclust 3 times, to evaluate the “single”, “complete” and “average” methods for heirarchical clustering. Assuming this hierarchcal clustering is meant to try and find relationships between the quantitative Wine properties and Wine quality, and knowing from the previous question that there are only 7 unique values for quality across the entire dataset, the next step is to cut out cluster trees to only have 7 branches.

wine_cleansed.hclust_single <- hclust(wine_cleansed.dist_matrix_euclidean, method = "single")
wine_cleansed.hclust_complete <- hclust(wine_cleansed.dist_matrix_euclidean, method = "complete")
wine_cleansed.hclust_average <- hclust(wine_cleansed.dist_matrix_euclidean, method = "average")

#Display dendrograms
plot(wine_cleansed.hclust_single, main="Hierarchical clustering White Wine", sub = "Single method", labels=FALSE)
rect.hclust(wine_cleansed.hclust_single, k=7, border="green") 

plot(wine_cleansed.hclust_complete, main="Hierarchical clustering White Wine", sub = "Complete method", labels=FALSE)
rect.hclust(wine_cleansed.hclust_complete, k=7, border="green") 

plot(wine_cleansed.hclust_average, main="Hierarchical clustering White Wine", sub = "Average method", labels=FALSE)
rect.hclust(wine_cleansed.hclust_average, k=7, border="green") 

wine_cleansed.hclust_single_g7 <- cutree(wine.hclust_single, k=7)
table(wine_cleansed.hclust_single_g7)
## wine_cleansed.hclust_single_g7
##    1    2    3    4    5    6    7 
## 3225    2    1    1    1    1    1
wine_cleansed.hclust_complete_g7 <- cutree(wine.hclust_complete, k=7)
table(wine_cleansed.hclust_complete_g7)
## wine_cleansed.hclust_complete_g7
##    1    2    3    4    5    6    7 
## 3094  123    7    1    5    1    1
wine_cleansed.hclust_average_g7 <- cutree(wine.hclust_average, k=7)
table(wine_cleansed.hclust_average_g7)
## wine_cleansed.hclust_average_g7
##    1    2    3    4    5    6    7 
## 3212    8    8    1    1    1    1

Interestingly, it’s already apparent from the dendrograms that the “single” and “average” trees are less similar with this set of data, and the “average” is now somewhere inbetween the “single” and “complete” while the “single” outcome remains largely unchanged. As such, I think it merits the expense of ploting the diamgrams to visually display the statistical comparison.

cleansed_d_list <- dendlist(
  "Single" = wine_cleansed.hclust_single %>% as.dendrogram,
  "Complete" = wine_cleansed.hclust_complete %>% as.dendrogram,
  "Average" = wine_cleansed.hclust_average %>% as.dendrogram
)

cleansed_cophentic_coefficient <- cor.dendlist(cleansed_d_list, "cophenetic")
# Print correlation matrix
round(cleansed_cophentic_coefficient, 2)
##          Single Complete Average
## Single     1.00     0.03    0.21
## Complete   0.03     1.00    0.61
## Average    0.21     0.61    1.00
corrplot(cleansed_cophentic_coefficient, "pie", "lower")

The results above suggest that removing outlies makes a massive difference to how these methods operate; this must be considered in particular with my choice os keeping with the Euclisean type of distance matrix.

wine_cleansed.hclust_complete_g7_table <- table(wine.q, wine_cleansed.hclust_complete_g7)
names(dimnames(wine_cleansed.hclust_complete_g7_table)) <- list("Quality", "Clusters")
wine_cleansed.hclust_complete_g7_table
##        Clusters
## Quality    1    2    3    4    5    6    7
##       3   12    0    1    0    0    1    0
##       4   88   28    1    0    0    0    0
##       5  902   54    5    0    1    0    0
##       6 1406   30    0    1    4    0    1
##       7  576    8    0    0    0    0    0
##       8  109    3    0    0    0    0    0
##       9    1    0    0    0    0    0    0
randIndex(wine_cleansed.hclust_complete_g7_table)
##        ARI 
## 0.01511415

Conclusion

Using hierarchical clustering over this dataset has not proved very insightful, in fact less colrrelation was found between qualty and cluster than with the k-means method. Also, the size of the dataset and the complexity of agglomerative clustering, {}(n^{2}(n)) makes the method unsuitable for the oringal size of the dataset. Even though the hierarchical cluster results haven’t shown obvousl relationship to Wine Quality; seeing the dendrograms and the coorplot output, if I were to use this method of clustering in the future, assuming the Euclidean distance matrix, I think I would definitely have preferences and less favoured clustering methods: I would imagine that the “single” method which clusters just by finding the nearest point between groups of child nodes would be the bottom of my list; think I would pick “Complete” (grouping clusters based on maximum distance) over “Average” though I think more reseashon this would have to be done in order to get a more informed decision as to why, beyond my own findings.

Question 2: White Wine clustering (Hierarchical)

Premise

You need to construct an MLP neural network for this problem. You need to consider the appropriate input vector, as well as the internal network structure (hidden layers, nodes, learning rate). You may consider any de-trending scheme if you feel is necessary. Write a code in R Studio to address all these requirements. You need to show the performance of your network both graphically as well as in terms of usual statistical indices (MSE, RMSE and MAPE). Hint: Experiment with various network structures and show a comparison table of their performances. This will be a good justification for your final network choice. Show all your working steps. As everyone will have different forecasting result, emphasis in the marking scheme will be given to the adopted methodology and the explanation/justification of various decisions you have taken in order to provide an acceptable, in terms of performance, solution. The input selection problem is very important. Experiment with various options (i.e. how many past values you need to consider as potential network inputs).

Preparation of data

As with Question 1, the data needs to be loaded, paritioned, then scaled. Assuming that this has all been done as before - there’s no need to demonstrate the exact same execution of code - we can simply demonstrate we have our data ready to work with.

Conclusion